Customer Segmentation in the US

In this project, the goal is to identify households that have a difficulty getting credit

In [1]:
import pandas as pd
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns

from scipy.stats.mstats import trimmed_var
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

Preparing the Data¶

Importing¶

The dataset tracks behaviors relating to the ways households earn, save, and spend money in the United States

In [2]:
# Reading the gzip-csv file into a DataFrame
df = pd.read_csv("SCFP2019.csv.gz")
print("df shape:", df.shape)
df.head()
df shape: (28885, 351)
Out[2]:
YY1 Y1 WGT HHSEX AGE AGECL EDUC EDCL MARRIED KIDS ... NWCAT INCCAT ASSETCAT NINCCAT NINC2CAT NWPCTLECAT INCPCTLECAT NINCPCTLECAT INCQRTCAT NINCQRTCAT
0 1 11 6119.779308 2 75 6 12 4 2 0 ... 5 3 6 3 2 10 6 6 3 3
1 1 12 4712.374912 2 75 6 12 4 2 0 ... 5 3 6 3 1 10 5 5 2 2
2 1 13 5145.224455 2 75 6 12 4 2 0 ... 5 3 6 3 1 10 5 5 2 2
3 1 14 5297.663412 2 75 6 12 4 2 0 ... 5 2 6 2 1 10 4 4 2 2
4 1 15 4761.812371 2 75 6 12 4 2 0 ... 5 3 6 3 1 10 5 5 2 2

5 rows × 351 columns

Explore¶

The dictionary of this dataset can be found at https://sda.berkeley.edu/sdaweb/docs/scfcomb2019/DOC/hcbkfx0.htm

The goal of this project is to identify households that have a difficulty getting credit so an interesting column to keep an eye out for is:

TURNFEAR | Household has been turned down for credit or feared being denied credit in the past 5 years

In [3]:
#Using a amask to subset create df to only households that have been turned down or feared being turned down for credit
mask = df["TURNFEAR"] == 1
df_fear = df[mask]
print("df_fear shape:", df_fear.shape)
df_fear.head()
df_fear shape: (4623, 351)
Out[3]:
YY1 Y1 WGT HHSEX AGE AGECL EDUC EDCL MARRIED KIDS ... NWCAT INCCAT ASSETCAT NINCCAT NINC2CAT NWPCTLECAT INCPCTLECAT NINCPCTLECAT INCQRTCAT NINCQRTCAT
5 2 21 3790.476607 1 50 3 8 2 1 3 ... 1 2 1 2 1 1 4 4 2 2
6 2 22 3798.868505 1 50 3 8 2 1 3 ... 1 2 1 2 1 1 4 3 2 2
7 2 23 3799.468393 1 50 3 8 2 1 3 ... 1 2 1 2 1 1 4 4 2 2
8 2 24 3788.076005 1 50 3 8 2 1 3 ... 1 2 1 2 1 1 4 4 2 2
9 2 25 3793.066589 1 50 3 8 2 1 3 ... 1 2 1 2 1 1 4 4 2 2

5 rows × 351 columns

Another useful feature to use will be age, the only columns age related are:

AGE | Age of reference person
AGECL | Age group of the reference person

When working with segmentation a good approach is to give priority to group related features. Although [AGECL] is numerical it is in fact a categorical feature

In [4]:
# Available age groups
age_groups = df_fear["AGECL"].unique()
print("Age Groups:", age_groups)
Age Groups: [3 5 1 2 4 6]
In [5]:
df_fear["AGECL"].head(10)
Out[5]:
5      3
6      3
7      3
8      3
9      3
110    5
111    5
112    5
113    5
114    5
Name: AGECL, dtype: int64

Looking at the dictionary there is a description of what each number represents

agecl

In [6]:
# Replacing numeric categories with readable ones
agecl_dict = {
    1: "Under 35",
    2: "35-44",
    3: "45-54",
    4: "55-64",
    5: "65-74",
    6: "75 or Older",
}

age_cl = df_fear["AGECL"].replace(agecl_dict)
age_cl.head(10)
Out[6]:
5      45-54
6      45-54
7      45-54
8      45-54
9      45-54
110    65-74
111    65-74
112    65-74
113    65-74
114    65-74
Name: AGECL, dtype: object
In [7]:
# Value counts to have a sense of how the age distribution is presented
age_cl_value_counts = age_cl.value_counts(normalize=True)

# Bar plot of `age_cl_value_counts`
age_cl_value_counts.plot(kind="bar")
plt.xlabel("Age Group")
plt.ylabel("Frequency (count)")
plt.title("Credit Fearful: Age Groups");

AGECL feature can give some insights but looking at [AGE] can also be a good idea regarding dataset exploration

In [8]:
# Plot histogram of "AGE"
df_fear["AGE"].hist(bins=10)
plt.xlabel("Age")
plt.ylabel("Frequency (count)")
plt.title("Credit Fearful: Age Distribution");

What is noticed is that although the majority group is Under 35 the most common age seems to be between 30 and 40 years old

Another useful feature to use will related to income:

INCOME | Total amount of income of household, 2019 dollars
INCCAT | Income percentile groups

To better visualize the information given in the data creating a DataFrame that shows the normalized frequency for income categories for both the credit fearful and non-credit fearful households in the dataset is ideal

In [9]:
# Income percentile groups dictionary
inccat_dict = {
    1: "0-20",
    2: "21-39.9",
    3: "40-59.9",
    4: "60-79.9",
    5: "80-89.9",
    6: "90-100",
}

# Data filtering and construction
df_inccat = (
    df["INCCAT"]
    .replace(inccat_dict)
    .groupby(df["TURNFEAR"])
    .value_counts(normalize=True)
    .rename("frequency")
    .sort_values(ascending=False)
    .to_frame()
    .reset_index()
)

df_inccat
Out[9]:
TURNFEAR INCCAT frequency
0 0 90-100 0.297296
1 1 0-20 0.288125
2 1 21-39.9 0.256327
3 1 40-59.9 0.228856
4 0 60-79.9 0.174841
5 0 40-59.9 0.143146
6 0 0-20 0.140343
7 0 21-39.9 0.135933
8 1 60-79.9 0.132598
9 0 80-89.9 0.108441
10 1 90-100 0.048886
11 1 80-89.9 0.045209
In [10]:
# Create bar chart of "df_inccat"
sns.barplot(
    x="INCCAT",
    y="frequency",
    hue="TURNFEAR",
    data=df_inccat,
    order=inccat_dict.values()
)

# Update axis and title details
plt.xlabel("Income Category")
plt.ylabel("Frequency (%)")
plt.title("Income Distribution: Credit Fearful vs. Non-fearful")

# Put the legend out of the figure
plt.legend(title='TURNFEAR', bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.);

Comparing the income categories across the fearful and non-fearful groups, it is noticeable that credit fearful households are much more common in the lower income categories. In other words, the credit fearful have lower incomes.

So far based on data exploration the people who responded that they were indeed worried about being approved for credit after having been denied in the past five years, a plurality of the young and low-income had the highest number of respondents. Which makes sense since young people tend to make less money and rely more heavily on credit to get their lives off the ground, so having been denied credit makes them more anxious about the future.

Establishing relationships among the variables, and making some correlation matrices is a good way to make more sense of the data

ASSET | Total value of assets held by household, 2019 dollars
HOUSES | Total value of primary residence of household, 2019 dollars

In [11]:
# Correlation ASSET X HOUSES in the whole dataset
asset_house_corr = df["ASSET"].corr(df["HOUSES"])
print("Asset Houses Correlation:", asset_house_corr)
Asset Houses Correlation: 0.5198273544779253

A positive moderate correlation which is expected since, usually, the most valuable asset one can own on is their property

In [12]:
# Correlation ASSET X HOUSES in the "df_fear" dataset
asset_house_corr = asset_house_corr = df_fear["ASSET"].corr(df_fear["HOUSES"])
print("Credit Fearful Asset Houses Correlation:", asset_house_corr)
Credit Fearful Asset Houses Correlation: 0.583287973597916

The correlation is even greater when using the fearful records

A few other features are also interesting to investigate

DEBT | Total value of debt held by household, 2019 dollars
EDUC | Highest completed grade by reference person

In [13]:
# Selecting all important features discovered so far
cols = ["ASSET", "HOUSES", "INCOME", "DEBT", "EDUC"]

# Checking their correlation on the whole dataset
corr = df[cols].corr()
corr.style.background_gradient(axis=None)
Out[13]:
  ASSET HOUSES INCOME DEBT EDUC
ASSET 1.000000 0.519827 0.622429 0.261250 0.116673
HOUSES 0.519827 1.000000 0.247852 0.266661 0.169300
INCOME 0.622429 0.247852 1.000000 0.114646 0.069400
DEBT 0.261250 0.266661 0.114646 1.000000 0.054179
EDUC 0.116673 0.169300 0.069400 0.054179 1.000000
In [14]:
# Checking their correlation in the "df_fear" dataset
corr = df_fear[cols].corr()
corr.style.background_gradient(axis=None)
Out[14]:
  ASSET HOUSES INCOME DEBT EDUC
ASSET 1.000000 0.583288 0.722074 0.474658 0.113536
HOUSES 0.583288 1.000000 0.264099 0.962629 0.160348
INCOME 0.722074 0.264099 1.000000 0.172393 0.133170
DEBT 0.474658 0.962629 0.172393 1.000000 0.177386
EDUC 0.113536 0.160348 0.133170 0.177386 1.000000

There are important differences, the relationship between DEBT and HOUSES is positive for both datasets, but while the coefficient for the whole dataset is fairly weak at 0.26, the same number for df_fear is 0.96

educ

Creating a DataFrame that shows the normalized frequency for education categories for both the credit fearful and non-credit fearful households in the dataset is ideal get a sense of the data. This will be similar in format to df_inccat, but focus on education and since the descriptions of the labels are quite long I will not be using a dictionary this time

In [15]:
# Data filtering and construction
df_educ = (
    df["EDUC"]
    .groupby(df["TURNFEAR"])
    .value_counts(normalize=True)
    .rename("frequency")
    .sort_values(ascending=False)
    .to_frame()
    .reset_index()
)
df_educ.head()
Out[15]:
TURNFEAR EDUC frequency
0 1 8 0.293532
1 0 12 0.257481
2 1 9 0.212200
3 0 8 0.192029
4 0 13 0.149823
In [16]:
# Create bar chart of "df_educ"
sns.barplot(
    x="EDUC",
    y="frequency",
    hue="TURNFEAR",
    data=df_educ
)
plt.xlabel("Education Level")
plt.ylabel("Frequency (%)")
plt.title("Educational Attainment: Credit Fearful vs. Non-fearful");

In this plot, it is noticeable that a much higher proportion of credit-fearful respondents have only a high school diploma, while university degrees are more common among the non-credit fearful, in fact, the trend only changes if the participant has at least a Bachelor's degree (Education Level 12)

Investigating DEBT x ASSET

In [17]:
# Creating a scatter plot of ASSET vs DEBT using the whole dataset
df.plot.scatter(x="DEBT", y="ASSET");
In [18]:
# Create scatter plot of ASSET vs DEBT using "df_fear"
df_fear.plot.scatter(x="DEBT", y="ASSET");

The relationship in df_fear graph is flatter than the one in df graph, but they clearly are different

Investigating DEBT x HOUSES

In [19]:
# Create scatter plot of HOUSES vs DEBT using the whole dataset
df.plot.scatter(x="DEBT", y="HOUSES");
In [20]:
# Create scatter plot of HOUSES vs DEBT using "df_fear"
df_fear.plot.scatter(x="DEBT", y="HOUSES");

There are outliers but the relationship is clear enough, the df_fear graph shows an almost perfect linear relationship, while the df graph shows something a little more muddled. It also noticable that the datapoints on the df_fear graph form several little groups

After all the insights obtained through the data analysis it might be time to start building the segmentation model. However 351 features is quite a large number. A good practice to select the best features for clustering is to determine which numerical features have the largest variance

In [21]:
# Checking the variance, getting 10 largest features
top_ten_var = df.var().sort_values().tail(10)
top_ten_var
Out[21]:
EQUITY      5.674332e+14
FIN         1.054968e+15
ACTBUS      2.246149e+15
KGBUS       2.398037e+15
KGTOTAL     2.805429e+15
BUS         3.184224e+15
NHNFIN      3.606887e+15
NFIN        3.764771e+15
NETWORTH    6.142046e+15
ASSET       6.275796e+15
dtype: float64
In [22]:
# Create horizontal bar chart of `top_ten_var`
fig = px.bar(
    x=top_ten_var,
    y=top_ten_var.index,
    title="SCF: High Variance Features"
)
fig.update_layout(xaxis_title="Variance", yaxis_title="Feature")
fig.show()
In [23]:
df.ASSET.hist();
In [24]:
df.NETWORTH.hist();

The dataset is massively right-skewed because of the huge outliers on the right side of the distribution so a treatment is definitely in order

I will be creating a wrangle function to help organize the code and clean the data

In [25]:
def wrangle(filepath):
    
    """
    Read a gzip / csv file into a 'DataFrame'.

    Returns treated Data in a pandas DataFrame

    Parameters
    ----------
    filepath: str
        Path of the gzip / csv file
    """
    
    # Reading
    df = pd.read_csv(filepath)
    # Masking all the fearful records and filtering only NETWORTHS below 2M
    mask = (df['TURNFEAR'] == 1) & (df['NETWORTH'] < 2_000_000)
    # Apllying the mask
    df = df[mask]
    return df
In [26]:
df_fear = wrangle("SCFP2019.csv.gz")
print(df_fear.shape)
df_fear.head()
(4418, 351)
Out[26]:
YY1 Y1 WGT HHSEX AGE AGECL EDUC EDCL MARRIED KIDS ... NWCAT INCCAT ASSETCAT NINCCAT NINC2CAT NWPCTLECAT INCPCTLECAT NINCPCTLECAT INCQRTCAT NINCQRTCAT
5 2 21 3790.476607 1 50 3 8 2 1 3 ... 1 2 1 2 1 1 4 4 2 2
6 2 22 3798.868505 1 50 3 8 2 1 3 ... 1 2 1 2 1 1 4 3 2 2
7 2 23 3799.468393 1 50 3 8 2 1 3 ... 1 2 1 2 1 1 4 4 2 2
8 2 24 3788.076005 1 50 3 8 2 1 3 ... 1 2 1 2 1 1 4 4 2 2
9 2 25 3793.066589 1 50 3 8 2 1 3 ... 1 2 1 2 1 1 4 4 2 2

5 rows × 351 columns

In [27]:
df_fear.NETWORTH.hist();
In [28]:
# Create a boxplot of `NHNFIN`
fig = px.box(
    data_frame=df_fear,
    x="NETWORTH",
    title="Distribution of Non-home, Non-Financial Assets"
)
fig.update_layout(xaxis_title="Value [$]")
fig.show()

Although much better there are still some values that might affect the total variance of a feature. That is the reason I will be using trimmed variance, where extreme values are removed before calculating variance.

In [29]:
# Calculate trimmed variance
top_ten_trim_var = df_fear.apply(trimmed_var, limits=(0.1, 0.1)).sort_values().tail(10)
top_ten_trim_var
Out[29]:
WAGEINC     5.550737e+08
HOMEEQ      7.338377e+08
NH_MORT     1.333125e+09
MRTHEL      1.380468e+09
PLOAN1      1.441968e+09
DEBT        3.089865e+09
NETWORTH    3.099929e+09
HOUSES      4.978660e+09
NFIN        8.456442e+09
ASSET       1.175370e+10
dtype: float64
In [30]:
# Create horizontal bar chart of "top_ten_trim_var"
fig = px.bar(
    x=top_ten_trim_var,
    y=top_ten_trim_var.index,
    title="SCF: High Variance Features"
)
fig.update_layout(xaxis_title="Trimmed Variance", yaxis_title="Feature")
fig.show()

There a few things to notice in this plot. First, the variances have decreased a lot. In the previous chart, the x-axis went up to \$80 billion; this one goes up to \\$12 billion. Second, the top 10 features have changed a bit. All the features relating to business ownership (ending with "BUS") are gone. Finally, it is noticeable that there are big differences in variance from feature to feature. For example, the variance for WAGEINC is around \$500 million, while the variance for `ASSET` is nearly \\$12 billion. In other words, these features have completely different scales. This is something that need to be addressed before making good clusters.

All in all the first five features seems to have the largest variance in a way that the sixth is around half of the fifth and moving forward they are all similar. For that reason, choosing the first five to build the model seem to be the optimum decision

In [31]:
high_var_cols = top_ten_trim_var.tail().index.to_list()
high_var_cols
Out[31]:
['DEBT', 'NETWORTH', 'HOUSES', 'NFIN', 'ASSET']

Split¶

In [32]:
# Creating the feature matrix X
X = df_fear[high_var_cols]
print("X shape:", X.shape)
X.head()
X shape: (4418, 5)
Out[32]:
DEBT NETWORTH HOUSES NFIN ASSET
5 12200.0 -6710.0 0.0 3900.0 5490.0
6 12600.0 -4710.0 0.0 6300.0 7890.0
7 15300.0 -8115.0 0.0 5600.0 7185.0
8 14100.0 -2510.0 0.0 10000.0 11590.0
9 15400.0 -5715.0 0.0 8100.0 9685.0

Build Model¶

Iterate¶

During EDA, it was noticed a scale issue among the features. That issue can make it harder to cluster the data, so it is necessary to fix that to further analysis. A good strategy to use is standardization

In [33]:
# Creating a DataFrame with the mean and standard deviation for all the features in X
X_summary = X.aggregate(["mean", "std"]).astype(int)
X_summary
Out[33]:
DEBT NETWORTH HOUSES NFIN ASSET
mean 72701 76387 74530 117330 149089
std 135950 220159 154546 239038 288166
In [34]:
# Instantiate transformer
ss = StandardScaler()

# Transform "X"
X_scaled_data = ss.fit_transform(X)

# Put "X_scaled_data" into DataFrame
X_scaled = pd.DataFrame(X_scaled_data, columns=X.columns)

print("X_scaled shape:", X_scaled.shape)
X_scaled.head()
X_scaled shape: (4418, 5)
Out[34]:
DEBT NETWORTH HOUSES NFIN ASSET
0 -0.445075 -0.377486 -0.48231 -0.474583 -0.498377
1 -0.442132 -0.368401 -0.48231 -0.464541 -0.490047
2 -0.422270 -0.383868 -0.48231 -0.467470 -0.492494
3 -0.431097 -0.358407 -0.48231 -0.449061 -0.477206
4 -0.421534 -0.372966 -0.48231 -0.457010 -0.483818

Now, all five of the features use the same scale. For verification, looking at their mean and standard deviation is ideal

In [35]:
# Creating a DataFrame with the mean and standard deviation for all the features in X_scaled
X_scaled_summary = X_scaled.aggregate(["mean", "std"]).astype(int)
X_scaled_summary
Out[35]:
DEBT NETWORTH HOUSES NFIN ASSET
mean 0 0 0 0 0
std 1 1 1 1 1

Everything looks accordingly since standardization takes all the features and scales them so that they all have a mean of 0 and a standard deviation of 1.

Using a for loop to build and train a K-Means model where n_clusters ranges from 2 to 12 (inclusive). Each time a model is trained, the inertia is calculated and added to the list inertia_errors, then the silhouette score is calculated and added to the list silhouette_scores.

In [36]:
# List of number of clusters
n_clusters = range(2, 13)
inertia_errors = []
silhouette_scores = []

# Add "for" loop to train model and calculate inertia, silhouette score
for k in n_clusters:
    # Build model
    model = make_pipeline(StandardScaler(), KMeans(n_clusters=k, random_state=42))
    # Train model
    model.fit(X)
    # Calculate inertia
    inertia_errors.append(model.named_steps["kmeans"].inertia_)
    # Calculate silhouette score
    silhouette_scores.append(silhouette_score(X, model.named_steps["kmeans"].labels_))

print("Inertia:", inertia_errors[:3])
print()
print("Silhouette Scores:", silhouette_scores[:3])
Inertia: [11028.058082607182, 7190.5263035753505, 5924.997726868028]

Silhouette Scores: [0.7464502937083215, 0.7044601307791996, 0.6962653079183132]

Using plotly express to create a line plot that shows the values of inertia_errors as a function of n_clusters:

In [37]:
# Creating line plot of `inertia_errors` vs `n_clusters`
fig = px.line(
    x=n_clusters,
    y=inertia_errors,
    title="K-Means Model: Inertia vs Number of Clusters"
)
fig.update_layout(
    xaxis_title="Number of Clusters [k]",
    yaxis_title="Inertia"
)
fig.show()

It is noticeable that the line starts to flatten out around 4 or 5 clusters

Using plotly express to create a line plot that shows the values of silhouette_scores as a function of n_clusters:

In [38]:
# Create a line plot of `silhouette_scores` vs `n_clusters`
fig = px.line(
    x=n_clusters,
    y=silhouette_scores,
    title="K-Means Model: Silhouette Score vs Number of Clusters"
)
fig.update_layout(
    xaxis_title="Number of Clusters", 
    yaxis_title="Silhouette Score"
)

fig.show()

This plot might less straightforward, but when putting the information from this plot together with the inertia plot, it seems like the best setting for n_clusters will be 4.

Building and training a new k-means model:

In [39]:
final_model = make_pipeline(
    StandardScaler(), 
    KMeans(n_clusters=4, random_state=42)
)
final_model.fit(X)
Out[39]:
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('kmeans', KMeans(n_clusters=4, random_state=42))])

Communicating the Results¶

Extracting the labels

In [40]:
labels = final_model.named_steps["kmeans"].labels_
print(labels[:5])
[0 0 0 0 0]
In [41]:
xgb = X.groupby(labels).mean()
xgb
Out[41]:
DEBT NETWORTH HOUSES NFIN ASSET
0 26551.075439 13676.153182 13745.637777 2.722605e+04 4.022723e+04
1 218112.818182 174713.441558 257403.246753 3.305884e+05 3.928263e+05
2 116160.779817 965764.155963 264339.449541 7.800611e+05 1.081925e+06
3 732937.575758 760397.575758 826136.363636 1.276227e+06 1.493335e+06

Using plotly express to create a side-by-side bar chart from xgb that shows the mean of the features in X for each of the clusters the final_model:

In [42]:
# Create side-by-side bar chart of `xgb`
fig = px.bar(
    xgb,
    barmode="group",
    title="Mean Household Finances by Cluster"
)

fig.update_layout(xaxis_title="Cluster", yaxis_title="Value [$]")
fig.show()

The clusters are based partially on NETWORTH, which means that the households in the 0 cluster have the smallest net worth, and the households in the 2 cluster have the highest. Based on that, there are some interesting things to unpack.

The value of the debt for the people in cluster 0 is higher than the value of their houses, suggesting that most of the debt being carried by those people is tied up in their mortgages, that is, if they own a home at all. Contrasting that with the other three clusters it is noticeable that the value of everyone else's debt is lower than the value of their homes

To represent the clusters in a plot, it is necessary to reduce the dimensionality of the data. Such process is achieved by using a model such PCA

Creating a PCA transformer then using it to reduce the dimensionality of the data in X to 2:

In [43]:
# Instantiate transformer
pca = PCA(n_components=2, random_state=42)

# Transform "X"
X_t = pca.fit_transform(X)

# Put "X_t" into DataFrame
X_pca = pd.DataFrame(X_t, columns=["PC1", "PC2"])

print("X_pca shape:", X_pca.shape)
X_pca.head()
X_pca shape: (4418, 2)
Out[43]:
PC1 PC2
0 -221525.424530 -22052.273003
1 -217775.100722 -22851.358068
2 -219519.642175 -19023.646333
3 -212195.720367 -22957.107039
4 -215540.507551 -20259.749306
In [44]:
# Create scatter plot of `PC2` vs `PC1`
fig = px.scatter(
    data_frame=X_pca,
    x="PC1",
    y="PC2",
    color=labels.astype(str),
    title="PCA Representation of Clusters",
)
fig.update_layout(xaxis_title="PC1", yaxis_title="PC2")
fig.show()

The plot represents four tightly-grouped clusters were made sharing some key features